CS7641 - Machine Learning

Videos: https://omscs.gatech.edu/cs-7641-machine-learning-course-videos

Supervised Learning (SL)

Definition: Take examples of inputs and outputs. Now, given a new input, predict its output.

Classification vs Regression

  • Regression: If output is continuous
  • Classification: If output is a small, discrete set

Classification Learning

Terms

Term Definition
Instances Input. Vectors of attributes that defines your input space.
Concept Function. Takes the input space and maps them to an output. Concept is a mapping between objects in a world and membership in a set, which makes it a function.
Target Concept The correct concept to map instances to oututs.
Hypothesis Class Set of concepts that you're willing to consider in order to find the target concept (e.g. Binary or multiclass, but not considering regression output).
Sample Training set. Set of all of our inputs paired with labels with the correct output. Foundation for inductive learning.
Candidate Concept Concept that you think is the target concept. (e.g. Anyone with curly hair is credit-worthy)
Testing Set Set of instances to test a candidate concept.

SL1: Decision Trees

Representation vs Algorithm

  • Decision Trees are a specific representations.
  • Algorithm is the method in which we build a decision tree.

Quiz: Running a decision tree

Work through each instance and give an answer based on the decision tree.

img

Note:

  • These examples are part of a test set.
  • The tree in the image is our candidate concept.
  • Some attributes doesn't matter, thus never used.

Decision Tree Algorithm

"20 questions" example: We want to narrow possibilities with each questions asked. This motivates us to ask very general Qs that significantly narrows down possiblities.

Learning Algorithm

  1. Pick the "best" attribute. "Best" is the attribute that splits the training set in half.
  2. Ask question
  3. Follow the answer path.
  4. Go to 1, until you get to an answer.

Quiz: Picking the "best" attribute

img

Notes:

  • Candidates 1 & 2 may be equally bad. 1 creates pretty much the same distribution after split. 2 doesn't split at all.

Expressiveness

Example: Boolean expressions as decision tress

img

Expressiveness per Tree Types

img

Type Size of Nodes Description
n-OR ("any" fn) linear $n$ Easy to solve
n-XOR ("odd/even parity" fn)* $O(2^{n})$ Hard to solve. No clever way to pick the right attribute to solve the problem (need to look at all attributes).

Takeaway: We hope that our ML problems are more like n-OR problems rather than n-XOR problems because n-XOR problems are hard and expensive.

* If number of attributes that are true is odd, then the output is true.

Expressiveness Continued

img

Q: Exactly how expressive is a decision tree?

Given:

  • n attributes (boolean) - $O(n\!)$ (combinatorial) to build nodes
  • outputs is boolean
  • Q1: If mapped to a truth table, how many rows are needed?
  • A1: $2^n$ rows
  • Q2: How many ways to fill in outputs (outputs are boolean)
  • A2: $2^{2^n}$ (really big number, grows very fast)

A: The hypothesis space of decision tree is very expressive. We need to cleverly search this space to efficiently find the best representation

ID3 Algorithms

Loop:
    1. Pick best attribute A
    2. Assign A as decision attribute for Node
    3. For each value A, create a descendant of Node
    4. Sort training examples to leaves
    5. Stop if examples perfectly classified
    6. Else iterate over leaves

Using Information Gain to Choose Best Attribute

img

Definition: Reduction in randomness over the labels you have over with the set of data based upon knowing the value of a particular attribute.

Entropy: "Average amount of information conveyed by an event, when considering all possible outcomes"

  • Intuition: Measure of the amount of information. Reducing entropy means we are gaining more information. Goal is to maximize information gain by choosing attribute that lowers entropy the most.
  • Example: Fair vs 2 heads coin. Fair coin has 50/50 chance of heads, making it a 1 bit entropy. 2 heads coin has 100 chance of head, making it zero entropy (no information).
    • The more of one label you have over another, the amount of entropy goes down.

Formula:

  • $S$: Collection of training examples
  • $A$: Specific attribute
  • $v$: Specific values of an attribute (categorical, continuous)
$$ Gain(S,A) = Entropy(S) - \Sigma_v\frac{|S_v|}{|S|} Entropy(S_v) $$

Formula of entropy: $$ -\Sigma_v p(v) log p(v) $$

ID3: Inductive Bias

Types of bias:

  1. Restriction bias: Hypothesis set that you care about.
    • For DT, includes all possible DT representations.
    • Does NOT include functions outside of DT (quadratic, etc).
  2. Preference bias: What sorts of hypotheses in the hypothesis set ($h \in H$) we prefer.
    • Inductive bias == preference bias

ID3's inductive bias - Types of preferences:

  • Prefers Good splits near the top.
  • Prefers representations that return correct answer over incorrect ones.
  • Prefers shorter trees to longer trees. Come naturally from preferring good splits near the top.

Decision Trees: Other Considerations

img

Continuous Attributes

Q: How to build DTs with continuous attributes?

  • A: Create a binary branching decision based on continuous ranges.

Q: T/F, does it make sense to repeat an attribute along a path in a tree?

  • A1: False for discrete value attributes because there is no information gain asking the same question.
  • A2: True for continous attributes, because we can ask different questions based on the same attribute. Further nodes down the branch could use different ranges to split.

When to stop

Overfitting: Don't trust the data too much because there is potential for noise. This happens when trees are too big.

Solutions

  • Validation: Use cross-validation, or use held out validation set to verify if we want to continue branching.
  • Pruning: Build the tree first, and then prune it based on how much error the pruning will cost (baesd on validation set).
    • Output is based on mode

How to build DT using regression?

Regression implies continuous outputs. Information gain doesn't work well with continuous values.

Q: What's the splitting criteria for regression trees?

  • A: You can split by variance to decide.

Q: What's the output?

  • A: You can use the average

Summary

img

SL2: Regression

Regression & Function Approximation

img

Etymology: "Regression to the mean"

  • Slope of less than 1 (ie. 2/3) represents many natural phenomenon like parent/child height relations.
  • "Regression to the mean" means using a functional form to approximate a set of data points.

Regression in ML

Finding the best line (measured by least squared error) is achieved through calculus (ie. finding the derivative).

Finding the best constant function

img

  • Above image shows how to derive the constant function of a squared sum of errors. Deriving the function shows the minimum.
  • In this case, the best constant is the average of all the $y$ values.

Polynomial Regression

Choosing the Order of Polynomial

img

  • The greater the order of polynomial used, the closer we can fit the data at the cost of sensitivity to noise (ie. overfitting).

Math: Polynomial Regression

img

img

  • Each $x$ (data point) is calculate to its nth-degree exponent in the function. This builds a matrix.
  • This is then matrix multiplied by the order of coefficient for each exponent.
  • Each row maps to one $y$ response.
  • $w$ is the network of weights of coefficients that give the best fit.

Errors, and where they come from

Training data has error that doesn't model $f$, results in $f+\epsilon$

Types of error sources:

  1. Sensor errors
  2. Malicious data
  3. Transcription error
  4. Unmodeled influences that is hard to model (ie. fluctuation in housing prices, time of day, buyer's mood)

Cross Validation

Definition: Take n-folds of training data then use one of the folds as the "test" data. Do this with each fold, using the rest as training data. Choose the model with the lowest error.

Aside - Assumptions of SL:

  • Train/test sets are IID (independent and identically distributed)

Applied to housing example in video:

  1. Gap between train/cross-val error narrows up to 3-4th degree
  2. Higher order polynomials begin overfitting to the training data at the expense of generalization on unseen examples.

img

Other Input Spaces

img

  • Regression can handle discrete, vector or scalar attributes.
  • Handling categorical: Enumerate (OK), or 1-hot encode (Better) - because enumeration gives categories a rank.

Summary

img

SL3: Neural Networks

video

Example of a perceptron (binary classifier with 3 input + weights)

img

Quiz: Example inputs and estimated y

img

How powerful is a perceptron unit?

Image: Representation of activation space given 2 data + weights. Generates a halfplane.

img

Takeaway: Perceptrons are linear functions that computes a hyperplane.

Perceptron Quizzes

Quiz 1: If we focus on $x_1 \in {0,1}$ and $x_2 \in {0,1}$, what is $y$?

  • A: AND Logic

Quiz 2: If $y$ is an OR function, what would we set $w_1$, $w_2$, $\theta$?

  • A: (Many answers) $w_1 = 0.5$, $w_2 = 0.5$, $w_3 = 0.5$, which moves the activation function slope downward.

Quiz 3: Given a NOT logic (one variable), what should we set $w_1$ and $\theta$?

  • A: $w_1 = -1$ (turns 0 into 0, 1 into -1), and $\theta = 0$ so any negative number is above zero.

Takeaway: AND/OR/NOT logic can be expressed as perceptron units, even XORs.

XOR as Perceptron Network Quiz

img

In [16]:
# Perceptron Quiz answer in code

def xor_network(x1, x2, w1, w2, w_and, theta) -> bool:
    activation = x1*w1 + x2*w2 + bool(x1 and x2)*w_and
    return 1 if activation >= theta else 0

def test_xor(w1, w2, w_and, theta): 
    assert xor_network(0,0, w1, w2, w_and, theta) == 0
    assert xor_network(0,1, w1, w2, w_and, theta) == 1
    assert xor_network(1,0, w1, w2, w_and, theta) == 1
    assert xor_network(1,1, w1, w2, w_and, theta) == 0
    print("Passed all tests")

w_1 = 1
w_2 = 1
w_and = -2
threshold = 1   
test_xor(w_1, w_2, w_and, threshold)
Passed all tests

Answer to XOR question:

  • We essentially want the network to work like an OR logic, except when AND is 1. - Previous quiz showed that OR logic can be created by setting $w_1$ and $w_2$ to 1 and $\theta$ to 1.
  • We can therefore add more negative weight to balance the OR logic by setting $w_\text{and}$ to -2, which offsets $\Sigma(x_1w_1, x_2w_1)$ and gives us 0. img

Perceptron Training

Definition: Given examples, find weights that map inputs to outputs.

2 ways in which perceptron training is done:

  1. Perceptron Rule: Using the threshold ($\theta$)
  2. Gradient Descent / Delta Rule: Using the "unthreshold" value.

Perceptron Rule

Goal: How to set the weights of a single unit so that it matches a training set.

img

Components:

  • $x$: input (single unit for this example)
  • $w_i$: Weight of $i$-th unit
  • $y$: Target
  • $\hat{y}$: Prediction
  • $\eta$: Learning rate. Affects amount of update rule.
  • $\theta$: Uses math trick so that we just set bias to 1 so we don't have to worry about the threshold.

(1) Weight change formula: Changing weight of $w_i$ by $\Delta w_i$. This is what we iterate on while there is error.

$$ w_i = w_i + \Delta w_i $$

(2) Defining weight change amount - First, calculate $\hat{y}$.

  • If y and $\hat{y}$ are equal, then we don't need to change the weight.
  • If they are not equal, then we get either a positive or negative value.
$$ \hat{y} = (\Sigma_i w_i x_i \ge 0) $$

(3) Determining the weight change needed based on distance between $y$ and $\hat{y}$. This is multiplied by the unit ($x$) value, counterbalanced by the learning rate $\eta$.

$$ \Delta w_i = \eta (y-\hat{y})x_i $$

Linearly Separable & Perceptron Rule

Definition: All examples can be separated by a linear hyperplane. This gets more complicated to visualize with 4+ dimensions.

Perceptron Rule: If we have data that is linearly separable, then the perceptron rule will find this hyperplane.

Q: What if the data is not linearly separable?

  • A: Continue until iteration limit is reached (not easy to figure out).

Gradient Descent

Motivation: More robust to linearly non-separable datasets.

  • Key is to not use thresholding.

img

Components:

  • $\alpha$: Activation, the sum of $x_i$ and $w_i$
  • $y$: Target
  • $\hat{y}$: Estimated output, ${\alpha \ge 0}$
  • $E(w)$: Error metric

(1) Calculate the error metric. Goal is to minimize this error.

  • $\frac{1}{2}$ is just a helper to make deriving the partial derivative easier.
$$ E(w) = \frac{1}{2} \Sigma_{(x,y)\in D} (y-\alpha)^2 $$

(2) Find the partial derivative of above.

  • Why? We do this to find the best weights to minimize the error.
$$ \frac{\partial E}{\partial w_i} = \frac{\partial}{\partial w_i} \frac{1}{2} \Sigma_{(x,y)\in D}(y-\alpha)^2 $$

(3) We get the result - The derivative of the error $E$ with respect to $w_i$. The result looks pretty similar to the perceptron rule.

$$ \frac{\partial E}{\partial w_i} = \Sigma_{(x,y) \in D} (y-a)(-x_i) $$

Comparison of learning rules

Type Learning Rule Pros Cons
Perceptron Rule $$\Delta w_i = \eta (y-\hat{y})x_i$$ Guaranteed finite convergence Requires linear separability
Gradient Descent $$\Delta w_i = \eta (y-\alpha)x_i$$ More robust to non-linearly seprable data Likely to converge only on local optimum

Quiz: Why not do gradient descent on $\hat{y}$?

  • Answer: Because $\hat{y}$ will always be an integer, hence it is non-differentiable. Example of how this will look like on a 2d plot below.
  • Significance: We can use a sigmoid instead which slightly bends the non-differentiable, disjointed line.

img

Sigmoid Function - Differentiable Threshold

img

(1) Sigmoid activation function $$ \sigma(x) = \frac{1}{1+e^{-x}} $$

(2) Derivative of the sigmoid function $$ \sigma`(x) = \sigma(x)(1-\sigma(x)) $$

  • As the sigmoid activation ($\alpha$) reaches infinity, it goes to 1
  • As it approaches negative infinity, it goes to 0
  • As zero, it is essentially $\frac{1}{1+e^{-\alpha}}$, which is 0.5.
  • Between -5 and 5, it creates a nice curve that makes it possible to perform gradient descent (ie. continuous line).

Extra: How to take the derivative of the sigmoid function. Reference: TDS

The Illustrated Neural Network

Key Points:

  • Each node is a sigmoid unit.
  • Each node is differentiable, so the entire network becomes differentiable.
  • Backpropagation: The act of sending the error information from the output back to the input, while updating the weights of each activation node in between.
    • "Computationally beneficial organization of the chain rule"
  • Local minima issue: High dimensional complex problems can have many local minima, hence the result of gradient descent can get stuck in the local but not global minima.

Advanced Methods to Optimize Weights

Alternatives/supplements to vanilla gradient descent:

  1. Momentum: Overcome noisy gradients (ie. local minima) by adding an additional hyperparameter that controls the amount of history (momentum) to include in the update equation.
    • A large momentum (e.g. 0.9) will mean that the update is strongly influenced by the previous update, whereas a modest momentum (0.2) will mean very little influence
  2. Higher order derivatives: The gradient is the first-order derivative for a multivariate objective function. Instead, we can use the second/third order derivative.
  3. Randomized optimization: Covered in future sections.
  4. Penalty for "complexity": Add penalty term for adding more nodes and layers, and/or large numbers.
    • Larger numbers are more likely to overfit, bc it lets you represent arbitrarily complex functions.

Restriction Bias

Definition: Representational power of the data structure you're using. Tells you the set of hypotheses you're willing to consider.

  • Lower bias == more hypotheses considered and vice versa.
  • Relevant to all supervised learning algorithms (not just neural nets).

Types of neural nets and their representational power:

  1. Perceptron: Linear function. Only considers planes.
  2. Perceptron network: Introduces boolean logic (XOR, etc). Starts capturing nonlinear functions.
  3. Sigmoid: More complex, less restriction
  4. Continuous function: Can be handle by one hidden neural network layer (if enough hidden units).
  5. Arbitrary function: Ie. disconnected functions. Can be represented by two hidden layers.

Preference Bias

Definition: The algorithm's selection (ie preference) of one representation over another.

Preference bias of neural nets: Small Random values that provide simpler explanations.

- SMALL: Favors small value, because networks with small numbers have low(er) complexity. Larger values tend to overfit.
- RANDOM: Variability. Possibility of exploring outside of local minimia.
- Similar to __Occam's Razor__: "Entities should not be multiplied unnecessarily"
    - Algorithm prefers stopping once it fits the data well, before letting weights to grow more.

Summary of Neural Nets

  1. Perceptrons - threshold unit
  2. Networks can produce any boolean function.
  3. Perceptron rule - Finite time for linearly separable.
  4. General differentiable rule - Backpropagation and gradient descent.
  5. Preference / restriction bias of neural networks.

SL4: Instance Based Learning

Pros Cons
Remembers (no catastrophic forgetting) Cannot generalize
Fast (no learning) Slower at runtime
Simple (Occams Razor) Easily overfits (Remembers noise as well)
Struggles with complexity (e.g. 2 same x with different y, multiple near distance neighbors with different y values)

K-Nearest Neighbor Algorithm

Given:

  • Training Data (set of pairs): $D=\{x_i, y_i\}$
  • Distance metric: $d(q, x)$
  • Number of neighbors: $K$
  • Query point (new data): $q$

(1) Find $K$ nearest neighbors of $q$ $$ NearestNeighbor = \{i:d(q,x_i)k \text{ smallest}\} $$

(2.1) Classification: Return the (weighted) vote of $y_i \in \text{NearestNeighbor}$ (ie. Mode)

(2.2) Regression: Return the (weighted) mean of $y_i \in \text{NearestNeighbor}$

Quiz 1: Runtime vs Space

Takeaways:

  1. KNN is faster to learn (running time is constant, space is linear), but slower at querying (query time is $\log_n$, space is constant).
    • If you query $n$ times overall, it will be less efficient than linear regression.
    • Lazy-learner: Essentially only does work at query time.
  2. Linear Regression is slower to learn (running time is linear, space is constant), but faster at quering (time+space is constant).
    • Eager-learner: Wants to learn right away.

Quiz 2: Testing KNN on Regression

How to calculate Manhattan/Euclidean distance: Both are derived from Mikowski distance ("p-norm distance"), with different degrees of $p$ value:

  • Given:
    • $x$: Input
    • $y$: Target
    • $p$: Order (1=Manhattan, 2=Euclidean)
  • Minkowski distance: $d(x,y) = \sqrt[p]{\sum_{i=1}^{n}|x_i - y_i|^p}$
  • Manhattan distance: $d(x,y) = \Sigma_{i=1}^n |x_i - y_i| $
  • Euclidean distance: $d(x,y) = \sqrt{\Sigma_{i=1}^n|x_i - y_i|^2}$

Using the first row as an example (x=(1,6), y=7):

  • Manhattan distance: $d(4,2) = | (1, 6) - (4, 2) | = | (1-4) + (6-2) | = 7$
  • Euclidean distance: $d(4,2) = \sqrt{| (1, 6) - (4, 2) |^2} = \sqrt{|1-4|^2 + |6-2|^2} = 5$
    • NOTE: In the video they use ED squared for readability.

Finding 1/3-NN:

  • How to find 1-NN: Find lowest distance. If more than one, get average.
  • How to find 3-NN: Find average of lowest 3 distance.
  • Manhattan:
    • 1N = 4. 2 datapoints, (2,4), 8 and (7,1), 50. $y=\mu{(8, 50)}=29$
    • 3NN = [4,6]. 4 datapoints (checked in MD column). $y=\mu{(8, 16, 50, 68)} = 33.5$
  • Euclidean:
    • 1N = 8. 1 datapoint, (2,4), 8. $y=\sqrt{8}$
    • 3NN = [8, 10, 20]. 3 datapoints (the ones checked in MD column in iamge). $y=\mu{(8, 50, 68)} = 42$

Error:

  • Target function: $y=x_1^2+x_2$
  • Given $q=(4,2)$, $y=18$

Takeaways:

  • KNN, using both Euclidean and Manhattan distance, were unable to predict anywhere near the target.
  • Prediction of KNN is heavily dependent on 1) How many neighbors are used, and 2) The distance metric used.
  • This implies an underlying bias of KNN.

KNN's Preference Bias

  1. Locality: Near points are similar (e.g. real-estate example)
    • Embedded in the distance function. Why would a specific distance function be better?
    • There is always one best distance function for a specific problem (given all else is fixed).
  2. Smoothness: Expects smoothly changing behavior between data points
  3. All features matter equally (e.g. Minkowski distance) - but this is not always the case
    • For example $y=x_1^2+x_2$, any change in $x_1$ causes a large change in $y$ comapred to $x_2$.

Curse of Dimensionality

Definition: As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.

Example: Number of neighbors needed to cover the same dimension space.

  1. 1D: Each point covers 1/10 of dimension space
  2. 2D: Each point cover 1/100 of dimension space
  3. 3D: Each point cover 1/1000 of dimension space

Some Other KNN Stuff

Distance metric

  • Your choice of distance function matters. Choosing the wrong distance function will yield poor behavior.
  • Euclidean / Manhattan are useful for regression. Why? Because it assumes you have continuous, numerical features.
  • Use domain knowledge to decide which distance function makes sense. More info on how to decide in this link.

Number of neighbors + weighted average

  • What if $K=n$? You end up with a constant function, ie. simple average of all $y_i$.
  • Weighted averages: Instead weight all $x_i$ equally, you put more weight on those closer.
  • Locally weighted regression: Instead of averaging, use $x_n$ data points closest to $q$ and fit a line to it.
  • Takeaway: KNN is useful, because it allows you to take local information and build arbitrary functions/concepts.

Example: Locally weighted linear regression

Summary of Instance-based Leaners

  • Lazy vs eager learners
  • KNN
  • Nearest neighbor: similarity (ie. distance) functions
  • NOTE: Both number of neighbors and similarity functions are both ways to capture domain knowledge.
  • Classification vs Regression (KNN can handle both)
  • Averaging (non-weighted vs weighted)
  • Combining local knowledge to fit a regression line.
  • Curse of dimensionality
  • Domain knowledge matters - this guides you to choosing which algorithm will work best.

SL5: Ensemble Learning: Bagging and Boosting

Bagging

Idea: Build up a bunch of rules and combine them together until you got something that's good enough.

Pseudocode:

1. Learn over a subset of data (uniform & random) and apply a learner
...(repeat)
2. Combine all the rules (How? Regression: Mean, Classification: Mode)

Example: Spam Email

Quiz: Ensemble Output

  • This is essentially the same as running KNN where $K=n$.

Bagging (Bootstrap Aggregation)

Takeaway:

  • Bagging averages out variance/noise in the data by fitting multiple learners to subsets of data.
  • Higher order regression will fit to all data, more chance of overfitting.
  • In this case, regression fits better to train but does worse on test. Bagging is reverse, meaning it generalizes better.

Boosting

Pseudocode (high-level):

1. Learn over a subset of data apply a learner
2. Filter down to the "hardest" examples, ie. data points that results in highest errors.
...(repeat)
3. Combine all the rules (How? Regression: WEIGHTED mean, Classification: WEIGHTED Mode)

Defining "hardest" examples

Classification error = # of mismatches

Given:

  • $x$: Datapoint
  • $h$: Hypothesis (from learner)
  • $c$: Ground truth
  • $D$: Distribution (of examples being drawn, we don't know how they're being drawn though)
$$ Pr_D [h(x) \ne c(x)] $$

"Weak" Leaner

Definition: A learning algorithm that, under any distribution, will find a hypothesis that performs better than chance on the current distribution.

  • Error: No matter what distribution you get ($\forall$), you're always going to get an error rate that is lower than a half (minus very small number ($\epsilon$)).
  • Takeaway: You're always learning "something" from the learner (ie. better than chance).

"Does better than chance" expressed more formally: For all distributions, your learner will have an expected error (hypothesis disagreeing with the real target of a single sample)

$$ \forall_D Pr_D [\cdot] \leq 1/2 - \epsilon $$

Quiz: Weighted Error

Quiz: Weak Learners (TODO: Come back to this one - I don't understand it)

  1. Find the distribution over 4 different examples ($x_i$) such that a learning algorithm that has to choose between one of the 3 hypotheses ($h_i$) will be able to find one that does better than chance (ie. expected error greater than half).
  2. Find a distribution over the 4 examples, a learning algorithm (looking at $x_{1,2,3}$) would not be able to return one of them that does better than chance.

ANSWER:

  1. Good distribution: If we give equal weight to each example, $h_1$ gets 3 of 4 correct which is better than chance. It's error distribution if 1/4.
  2. Evil distribution: I don't understand how they got [1/2, 1/2, 0, 0]?

Some explanation in Slack thread:

One can ask a simple question: given a set of objects, X, what is the probability of seeing any particular object, x, that is drawn from X according to some distribution D? By definition, D, induces a value for each x that is between 0 and 1 inclusive where the sum of all these values is 1.

In our case, each x has some corresponding label, let’s call each on y.

A weak learner is simply a learning function that given X, D, Y, and a set of hypotheses, H, will find some hypothesis, h, drawn from H, such that the probability of drawing an x such that h(x) == y is better than chance.

Boosting Algorithm (Detailed)

Pseudocode (left side)

  1. For a given training set, assume that each target has a value of either -1 or +1 (binary class)
  2. Enter a loop for $T$ times ($t$ is each time step)
    1. Construct a distribution over your examples ($D_t$)
    2. Find a weak classifier ($h_{t}(x)$) with a small amount of error ($\epsilon_t$) given a specific $D_t$. (This is where we iteratively learn the error from the previous learners).
      • $\epsilon$ is expected to have error rate smaller than chance (ie. 50%)
  3. Output a final hypotheses $H$

Constructing the Distribution (right side)

  1. Base case: $D_{1}(i)$ is initialized with a uniform distribution.
    • Why? We don't know whether one example is more important than another, so just assume they're all equal.
  2. At each time step, we construct a new distribution over the examples ($D_{t+1}(i)$):
    • The big equation is how the new distribution is calculated. The important part is that for a binary classification problem, if y and h agree it return 1 and -1 if they disagree.
    • Alpha is a constant (0-1) that is multiplied with the error.
    • If y and h agree, the distribution for that example ($D(i)$) decreases
      • Math: Agreement $y_i = h_t$ returns 1, which is multiplied by alpha then negated. $e$ to the exponent of a negative returns a number less than 1 (e^(-n) means 1/e^n).
    • Denominator $Z_t$ normalizes the result
  3. Significance: For each weak learner, it puts more weight on examples that it gets wrong and less weight on those it gets right.
    1. If there is agreement on one example but at least one other one disagrees, $D_t(i)$ will DECREASE.
    2. If there is disagreement on one example but at least one other one agrees, $D_t(i)$ will INCREASE.
    3. If we assume each weak learner always gets error rate less than chance, the new weak learner will get some of the ones the learner at $t_{-1}$ was getting wrong.

Constructing the final hypothesis

We generate the hypothesis by getting the sum of all of our alpha times its hypothesis, then run it through a sign function (returns the sign, ie. -1, 1, 0 of the number).

Formula for final hypothesis $H$:

$$ \begin{align} H_{final}(x) = sgn(\Sigma_t \alpha_t h_t (x)) \end{align} $$

Boosting Example

Boosting Intuition

  • Each weak learner does better than chance.
  • New distribution of examples are weighted differently depending on if weak learner at time $t$'s hypothesis was correct. Those correct are weighed down and those with error goes up.
  • New learner at $t+1$ will now focus on getting those that the previous learner got wrong.
  • The sum of alpha and hypothesis gives our final hypothesis $H$.
  • There is a property of information gain, that each iteration must always be better than chance w.r.t error.

Summary

  • Bagging: Subsample bunch of examples and train different learners on each. Merge them with mean or mode (depending on regression/classification).
  • Combining simple learners give you ensemble learner that can fit complex data.
  • Boosting is really good.
  • Ensemble learners are agnostic to what type of weak learners to use.
  • Error is related to distribution $D$ for Boosting.
  • Overfitting: In practice, boosting is resilient to overfitting.

SL6: Kernel Methods & SVMs

  • Assumption: Dealing with linearly separable data + binary labels.
  • The middle lines explains the data correctly, but is the least committed to the sample set. This prevents overfitting.

SVM

Supporting material/notes for this section: My notes from ISYE6501

Linear Classifiers

Given:

  • $y$ - label
  • $w$ - parameters of the hyperplane
  • $b$ - bias
$$ y = w^T_x + b $$

Decision Boundaries and Margins:

  • Decision boundary is zero, ie $w^T_x + b = 0$. (This is the middle line).
  • Decision boundary (assuming binary classification (labels 1 and -1) + linearly separable data):
    1. Hyperplane for positive class: $w^T_x+b=1$
    2. Hyperplane for negative class: $w^T_x+b=-1$
  • We want to maximize the margin, ie distance between the two margin planes.

Quiz: Solving for the margin

Q: Solve for the line that is described by their difference (...?)

Answer:

  • Essentially asking for $W^T(x_1 - x_2)$ which is 2 (given class 1 and -1)
  • But wants the vector - with some tricks, it is $\frac{2}{||w||}$ (bottom right)
  • Explanation: We are solving for the parameters of the two hyperplane so that we maximize their distance (ie. maximize perpendicular line to parallel lines $x_1$ and $x_2$, BUT with the constraint that all classifications are still correct. This is called the margin.
  • tl;dr - SVM's goal is to find a decision boundary that maximizes the margin, subject to the constraint that all points are correctly classified

SVM as an Optimization Problem

Idea: We can turn the $\max \{ \frac{2}{||w||}\}$ solution into a quadratic programming function.

  • Why? Quadratic programming problems are easier to solve.

Takeaway: All 3 optimization models are solving the same problem, but quadratic is easier to solve:

  1. $\max \{ \frac{2}{||w||}\}$
  2. $\min\{ \frac{1}{2}||w||^2\}$
  3. Bottom $\max \{W(\alpha)\}$ equation in image above (complicated, trust instructors on this one)

SVM Continued: Solving for $w$

Goal: After finding $\alpha$ that maximizes the quadratic optimization program, you can solve for $w$, ie. the params of the hyperplane. Once we find that, we can solve for b by plugging in $w$ and $x_i$.

Some considerations:

  • Most alphas are zeros (assumption given by instructors), which means that $x$'s with zero alphas won't contribute to $w$ at all.
  • Put it another way: Only a few $x$'s (with non-zero alpha) actually contribute to $w$. These data points are the support vectors.
In [ ]:
 

SL7: Computational Learning Theory

SL8: VC Dimensions

SL9: Bayesian Learning

SL10: Bayesian Inference


End of Midterm 1 material


UL1: Randomized Optimization

UL2: Clustering

In [ ]: